Credit Card Users Churn PredictionΒΆ

Problem StatementΒΆ

Business ContextΒΆ

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

Data DescriptionΒΆ

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: Gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank (in months)
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: Total Revolving Balance on the Credit Card
  • Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
  • Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
  • Avg_Utilization_Ratio: Average Card Utilization Ratio

What Is a Revolving Balance?ΒΆ

  • If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance
What is the Average Open to buy?ΒΆ
  • 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.
What is the Average utilization Ratio?ΒΆ
  • The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:ΒΆ
  • ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1

Please read the instructions carefully before starting the project.ΒΆ

This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.

  • Blanks '_______' are provided in the notebook that

needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space.

  • Identify the task to be performed correctly, and only then proceed to write the required code.
  • Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
  • Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
  • Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.

Importing necessary librariesΒΆ

InΒ [7]:
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user
InΒ [8]:
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn xgboost==2.0.3 -q --user
!pip install --upgrade -q threadpoolctl

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

InΒ [10]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

# from sklearn.tree import DecisionTreeRegressor
# from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, StackingRegressor
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import tree # for DecisionTreeClassifier

# To tune model, get different metric scores, and split data
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay
)

# To impute missing values
from sklearn.impute import SimpleImputer
from sklearn import metrics

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

Loading the datasetΒΆ

InΒ [12]:
#Loading dataset
data=pd.read_csv("BankChurners.csv")

Data OverviewΒΆ

  • Observations
  • Sanity checks

View first five rows of the datasetΒΆ

InΒ [16]:
data.head()
Out[16]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000

Check data types and number for non-null values from each columnΒΆ

InΒ [18]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB

Observation

  • There are 20 columns with 10127 rows of data.
  • Marital_Status column and Educational_Level column have missing data
  • There are several object types:
    • Attribution_Flag, Gender, Educational_Level, Marital_Status, Income_Category, Card_Category
  • CLIENTNUM represents a key and is not numerically significant

Next step: The number of non-null values of each column is equal to the number of total rows in the dataset i.e. no null value. We can further confirm this using isna() method.

InΒ [20]:
data.isna().sum()
Out[20]:
CLIENTNUM                      0
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64

Observation

As expected:

  • Educational_Level is missing about 15% of the rows
  • Marital_Status is missing is about 7% of the rows

Summary of the datasetΒΆ

InΒ [23]:
# Summary of the continuous columns
data[['Customer_Age','Dependent_count','Months_on_book','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon','Credit_Limit','Total_Revolving_Bal','Avg_Open_To_Buy','Total_Amt_Chng_Q4_Q1','Total_Trans_Amt','Total_Trans_Ct','Total_Ct_Chng_Q4_Q1','Avg_Utilization_Ratio']].describe().T
Out[23]:
count mean std min 25% 50% 75% max
Customer_Age 10127.000 46.326 8.017 26.000 41.000 46.000 52.000 73.000
Dependent_count 10127.000 2.346 1.299 0.000 1.000 2.000 3.000 5.000
Months_on_book 10127.000 35.928 7.986 13.000 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.000 3.813 1.554 1.000 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.000 2.341 1.011 0.000 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.000 2.455 1.106 0.000 2.000 2.000 3.000 6.000
Credit_Limit 10127.000 8631.954 9088.777 1438.300 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.000 1162.814 814.987 0.000 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.000 7469.140 9090.685 3.000 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.000 0.760 0.219 0.000 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.000 4404.086 3397.129 510.000 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.000 64.859 23.473 10.000 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.000 0.712 0.238 0.000 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.000 0.275 0.276 0.000 0.023 0.176 0.503 0.999

Observations

  • Mean and median value for Customer_Age is approx 46

  • Mean and median value for Months_on_book is approx 36

  • Mean and median value for Dependent_count is approx 2, although there are outliers

  • Mean and median value for Total_Relationship_Count is approx 4, although there are outliers

  • Outliers may be significant in Months_Inactive_12_mon, Contacts_Count_12_mon

  • Total_Revolving_Bal, Total_Amt_Chng_Q4_Q1, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1 all have many outliers

  • Right skewed data (Mean > Median):

    • Credit_Limit where the mean is 8.6K and the median is 4.5K with outliers
    • Avg_Open_To_Buy where the mean is 7.4K and the median is 3.4K with outliers
    • Total_Trans_Amt where the mean is 4.4K and the median is 3.9K with many outliers
    • Avg_Utilization_Ratio where the mean is .27 and the median is .18

Number of unique columnsΒΆ

InΒ [26]:
data.nunique()
Out[26]:
CLIENTNUM                   10127
Attrition_Flag                  2
Customer_Age                   45
Gender                          2
Dependent_count                 6
Education_Level                 6
Marital_Status                  3
Income_Category                 6
Card_Category                   4
Months_on_book                 44
Total_Relationship_Count        6
Months_Inactive_12_mon          7
Contacts_Count_12_mon           7
Credit_Limit                 6205
Total_Revolving_Bal          1974
Avg_Open_To_Buy              6813
Total_Amt_Chng_Q4_Q1         1158
Total_Trans_Amt              5033
Total_Trans_Ct                126
Total_Ct_Chng_Q4_Q1           830
Avg_Utilization_Ratio         964
dtype: int64

Observations

  • Can drop the CLIENTNUM column as it is an ID variable and will not add value to the model
InΒ [28]:
# Dropping columns from the dataframe
data.drop(columns=['CLIENTNUM'], inplace=True)

Number of observations in each categoryΒΆ

InΒ [30]:
cat_cols=['Attrition_Flag','Gender','Education_Level','Marital_Status','Income_Category','Card_Category']

for column in cat_cols:
    print(data[column].value_counts())
    print('-'*30)
Existing Customer    8500
Attrited Customer    1627
Name: Attrition_Flag, dtype: int64
------------------------------
F    5358
M    4769
Name: Gender, dtype: int64
------------------------------
Graduate         3128
High School      2013
Uneducated       1487
College          1013
Post-Graduate     516
Doctorate         451
Name: Education_Level, dtype: int64
------------------------------
Married     4687
Single      3943
Divorced     748
Name: Marital_Status, dtype: int64
------------------------------
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: Income_Category, dtype: int64
------------------------------
Blue        9436
Silver       555
Gold         116
Platinum      20
Name: Card_Category, dtype: int64
------------------------------

Observations

  • Attrited Customer represents 19% of the customer base
  • 47% of the customer base are males
  • There are six catagories for Education_Level
  • There are three categories for Marital_Status
  • There is a nonsense category in Income_Category with abc as data that represents about 11% of customer data
  • There are very few Platinum customers

Make a copy of the datasetΒΆ

InΒ [33]:
df = data.copy()

# `df` is not used in this notebook, but is available for future use 

Exploratory Data Analysis (EDA)ΒΆ

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. How is the total transaction amount distributed?
  2. What is the distribution of the level of education of customers?
  3. What is the distribution of the level of income of customers?
  4. How does the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
  5. How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
  6. What are the attributes that have a strong correlation with each other?

The below functions need to be defined to carry out the Exploratory Data Analysis.ΒΆ

InΒ [38]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
InΒ [39]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
InΒ [40]:
# function to plot stacked bar chart

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
InΒ [41]:
### Function to plot distributions

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()
InΒ [42]:
### Univariate analysis

Compare the Attrition_FlagΒΆ

Before going into the question, let's view the size of the attrition values.

InΒ [45]:
labeled_barplot(data,'Attrition_Flag',perc=True)
No description has been provided for this image

How is the total transaction amount distributed?ΒΆ

InΒ [47]:
histogram_boxplot(data, 'Total_Trans_Amt')
No description has been provided for this image

Observations

  • The data has four identifiable peaks
  • The data is right skewed

What is the distribution of the level of education of customers?ΒΆ

InΒ [50]:
labeled_barplot(data,'Education_Level',perc=True)
No description has been provided for this image

Observations

  • The largest number is Graduate
  • NOTE: There was missing data in education category

What is the distribution of the level of income of customers?ΒΆ

InΒ [53]:
labeled_barplot(data,'Income_Category',perc=True)
No description has been provided for this image

Observations

  • A significant number have incomes less than $40K
  • abc is a nonsense field
  • The 40K - 60K, 60K - 80K, and 80K-120K have around the same values, roughly half the number of less than $40K

Bivariate analysisΒΆ

How does the change in transaction amount between Q4 and Q1 (Total_Ct_Chng_Q4_Q1) vary by the customer's account status (Attrition_Flag)?ΒΆ

InΒ [57]:
sns.set(rc={'figure.figsize':(21,7)})
sns.catplot(x="Attrition_Flag", y="Total_Ct_Chng_Q4_Q1", kind="boxen", data=data, height=7);
No description has been provided for this image
InΒ [58]:
sns.swarmplot(data, x="Total_Ct_Chng_Q4_Q1", y="Attrition_Flag")
Out[58]:
<Axes: xlabel='Total_Ct_Chng_Q4_Q1', ylabel='Attrition_Flag'>
No description has been provided for this image
InΒ [59]:
sns.catplot(data, x="Total_Ct_Chng_Q4_Q1", y="Attrition_Flag", kind="violin")
Out[59]:
<seaborn.axisgrid.FacetGrid at 0x2202f2d6310>
No description has been provided for this image
InΒ [60]:
sns.histplot(data, x="Total_Ct_Chng_Q4_Q1", hue="Attrition_Flag", multiple="dodge", bins=30)
Out[60]:
<Axes: xlabel='Total_Ct_Chng_Q4_Q1', ylabel='Count'>
No description has been provided for this image

Observation

As the charts show, Attrited Customer has lower Total_Ct_Chng_Q4_Q1.

  • The count is lower for Attrited Customer
  • The mean is lower for Attrited Customer

How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?ΒΆ

InΒ [63]:
sns.set(rc={'figure.figsize':(21,7)})
sns.catplot(x="Attrition_Flag", y="Months_Inactive_12_mon", kind="boxen", data=data, height=7);
No description has been provided for this image
InΒ [64]:
sns.histplot(data, x="Months_Inactive_12_mon", hue="Attrition_Flag", multiple="dodge")
Out[64]:
<Axes: xlabel='Months_Inactive_12_mon', ylabel='Count'>
No description has been provided for this image

What are the attributes that have a strong correlation with each other?ΒΆ

InΒ [66]:
sns.set(rc={'figure.figsize':(16,10)})
sns.heatmap(data.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cbar=False,
            cmap="Spectral")
plt.show()
No description has been provided for this image

Observations

Positive correlations

Type Values Column Value Description
Strong positive correlation r > 0.75 Average_Open_To_Buy and Credit_Limit 1.0 Essentially two ways of saying the same thing ‐ that you have more credit available to you
Strong positive correlation r > 0.75 Total_Trans_Amt and Total_Trans_Ct .81 The total amounts and counts are highly correlated
Strong positive correlation r > 0.75 Months_on_book and Customer_Age .79 The older the customer gets, the more likely they are to be a customer longer. (Another way of staying younger people have newer accounts )

Moderate positive relationship| r between 0. and 0.7 | Total_Revolving_Bal and Avg_Utilization_Ratio| .62| (A higher balance lends itself to a higher ratio of use |)| Weak positive relationship| between 0 and 0.2 | Total_Ct_Chng_Q4_Q1 and Total_Amt_Chng_Q4_Q1| .38| (Changes in quarterly count are weakly correlated to changes in amount, at least in the same directio |))

Negative corrlations

Type Values Column Value Description
Weak negative relationship between 0 and -.25 Total_Relationship_Count and Total_Trans_Ct -.24 Larger families may mean lesser total transaction count (Similar to the next item)
Weak negative relationship between 0 and -.25 Total_Relationship_Count and Total_Trans_Amt -.35 Larger families may mean lesser total transaction amount
Moderate negative relationship between -.50 and -.75 Credit_Limit and Avg_Utilization_Ratio -.48 Higher credit limit generally means you don't need to use as much in proportion to that limit
Moderate negative relationship between -.50 and -.75 Avg_Open_To_Buy and Avg_Utilization_Ratio -.54 Higher amount you can buy genearlly means you don't need to use as much in proportion to your limit

Additional analysisΒΆ

The scatter plot matrix can help us visually identify significant patterns and groupings in the data.

InΒ [71]:
# Scatter plot matrix
#num_features = ['Credit_Limit', 'Total_Trans_Amt', 'Months_on_book','Total_Ct_Chng_Q4_Q1']
num_features = ['Customer_Age','Dependent_count','Months_on_book','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon','Credit_Limit','Total_Revolving_Bal','Avg_Open_To_Buy','Total_Amt_Chng_Q4_Q1','Total_Trans_Amt','Total_Trans_Ct','Total_Ct_Chng_Q4_Q1','Avg_Utilization_Ratio']
plt.figure(figsize=(12, 8))
sns.pairplot(data, vars=num_features, hue='Attrition_Flag', diag_kind='kde');
<Figure size 1200x800 with 0 Axes>
No description has been provided for this image

Observations

Key observable grouping of Attrited and Existing customers can be seen in the charts. In particular:

  • Total_Trans_Amt and Credit_Limit
  • Months_on_book and Total_Trans_Amt
  • In every feature with Total_Trans_Ct and Total_Trans_Amtand Total_Ct_Chng_Q4_Q1

Lesser correlation can be seen:

  • Months_on_book and Credit_Limit

Also in many features along the diagonal

  • Attrited and Existing customers have generally the same skewness

Data Pre-processingΒΆ

Outlier detection and treatmentΒΆ

InΒ [75]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()


plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
No description has been provided for this image

Observations

  • There are quite a few outliers in the data
  • The data generally appear to be proper values

Remove index columnΒΆ

InΒ [78]:
# Completed in an earlier step
# data.drop(columns=['CLIENTNUM'], inplace=True)

Removing duplicate columnsΒΆ

InΒ [80]:
# Remove Avg open to buy as it is a duplicate of Credit Limit
data.drop(columns=['Avg_Open_To_Buy'], inplace=True)

Get a lists of the columns that are object type to convert into numeric values

Do any columns remain without values?

InΒ [83]:
data.isna().sum()
Out[83]:
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64

Fix Income_Category nonsense for abc dataΒΆ

Replace the abc values with the missing value and we can impute in a following step

InΒ [86]:
data['Income_Category'].value_counts()
Out[86]:
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: Income_Category, dtype: int64
InΒ [87]:
data['Income_Category'] = data['Income_Category'].replace(['abc'], '$80K - $120K')
InΒ [88]:
# sanity check for Income Category replacement
print(data['Income_Category'].value_counts())
Less than $40K    3561
$80K - $120K      2647
$40K - $60K       1790
$60K - $80K       1402
$120K +            727
Name: Income_Category, dtype: int64
InΒ [89]:
# sanity check for Income Category replacement
print(data['Marital_Status'].value_counts())
Married     4687
Single      3943
Divorced     748
Name: Marital_Status, dtype: int64

Feature engineeringΒΆ

Pass numerical values for each categorical column for imputation so we will label encode them

These following categories do have missing data.

For imputed data to be created, the categories are changed into numerical data.

InΒ [94]:
marital_status = {
    'Married' : 0,
    'Single' : 1,
    'Divorced' : 2
}

data["Marital_Status"] = data["Marital_Status"].map(marital_status)

education_level = {
    'Uneducated' : 0,
    'High School': 1,
    'Graduate': 2,
    'College': 3,
    'Post-Graduate': 4,
    'Doctorate': 5     
}
data["Education_Level"] = data["Education_Level"].map(education_level)

SummaryΒΆ

InΒ [96]:
data.head()
Out[96]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 Existing Customer 45 M 3 1.000 0.000 $60K - $80K Blue 39 5 1 3 12691.000 777 1.335 1144 42 1.625 0.061
1 Existing Customer 49 F 5 2.000 1.000 Less than $40K Blue 44 6 1 2 8256.000 864 1.541 1291 33 3.714 0.105
2 Existing Customer 51 M 3 2.000 0.000 $80K - $120K Blue 36 4 1 0 3418.000 0 2.594 1887 20 2.333 0.000
3 Existing Customer 40 F 4 1.000 NaN Less than $40K Blue 34 3 4 1 3313.000 2517 1.405 1171 20 2.333 0.760
4 Existing Customer 40 M 3 0.000 0.000 $60K - $80K Blue 21 5 1 0 4716.000 0 2.175 816 28 2.500 0.000

The data in the missing columns have been converted into numerics, so they can be imputed.

Data preparation for model buildingΒΆ

Separate features from the target columnΒΆ

The target column is Attrition_Flag

It has already been converted to numeric

InΒ [101]:
# Separating features and the target column
X = data.drop('Attrition_Flag', axis=1)
y = data['Attrition_Flag'].apply(lambda x: 1 if x == "Attrited Customer" else 0)

Split the data into train and test setsΒΆ

InΒ [103]:
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
InΒ [104]:
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 18) (2026, 18) (2026, 18)
InΒ [105]:
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in validation data =", X_val.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 6075
Number of rows in validation data = 2026
Number of rows in test data = 2026

Missing value treatmentΒΆ

InΒ [107]:
df.isnull().sum()
Out[107]:
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
dtype: int64

Fix Education_Level and Marital_Status missing valuesΒΆ

InΒ [109]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Get list of categorical and numerical columns
cat_cols = list(X_train.select_dtypes(include='object').columns)
num_cols = list(X_train.select_dtypes(include=['int', 'float']).columns)

# Impute categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')
X_train[cat_cols] = cat_imputer.fit_transform(X_train[cat_cols])
X_val[cat_cols] = cat_imputer.transform(X_val[cat_cols])
X_test[cat_cols] = cat_imputer.transform(X_test[cat_cols])

# Impute numerical columns
num_imputer = SimpleImputer(strategy='mean')
X_train[num_cols] = num_imputer.fit_transform(X_train[num_cols])
X_val[num_cols] = num_imputer.transform(X_val[num_cols])
X_test[num_cols] = num_imputer.transform(X_test[num_cols])
InΒ [110]:
# Checking that no column has missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64

Reverse Mapping for Encoded VariablesΒΆ

InΒ [112]:
## Function to inverse the encoding
def inverse_mapping(x, y):
    inv_dict = {v: k for k, v in x.items()}
    X_train[y] = np.round(X_train[y]).map(inv_dict).astype("category")
    X_val[y] = np.round(X_val[y]).map(inv_dict).astype("category")
    X_test[y] = np.round(X_test[y]).map(inv_dict).astype("category")
InΒ [113]:
inverse_mapping(marital_status, "Marital_Status")
inverse_mapping(education_level, "Education_Level")

The mappings replace the numberic values with category values

Train DatasetΒΆ

InΒ [116]:
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_train[i].value_counts())
    print("*" * 30)
F    3193
M    2882
Name: Gender, dtype: int64
******************************
Graduate         2782
High School      1228
Uneducated        881
College           618
Post-Graduate     312
Doctorate         254
Name: Education_Level, dtype: int64
******************************
Single      2826
Married     2819
Divorced     430
Name: Marital_Status, dtype: int64
******************************
Less than $40K    2129
$80K - $120K      1607
$40K - $60K       1059
$60K - $80K        831
$120K +            449
Name: Income_Category, dtype: int64
******************************
Blue        5655
Silver       339
Gold          69
Platinum      12
Name: Card_Category, dtype: int64
******************************

Validation DatasetΒΆ

InΒ [118]:
cols = X_val.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_val[i].value_counts())
    print("*" * 30)
F    1095
M     931
Name: Gender, dtype: int64
******************************
Graduate         917
High School      404
Uneducated       306
College          199
Post-Graduate    101
Doctorate         99
Name: Education_Level, dtype: int64
******************************
Married     960
Single      910
Divorced    156
Name: Marital_Status, dtype: int64
******************************
Less than $40K    736
$80K - $120K      514
$40K - $60K       361
$60K - $80K       279
$120K +           136
Name: Income_Category, dtype: int64
******************************
Blue        1905
Silver        97
Gold          21
Platinum       3
Name: Card_Category, dtype: int64
******************************

Test DatasetΒΆ

InΒ [120]:
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_test[i].value_counts())
    print("*" * 30)
F    1070
M     956
Name: Gender, dtype: int64
******************************
Graduate         948
High School      381
Uneducated       300
College          196
Post-Graduate    103
Doctorate         98
Name: Education_Level, dtype: int64
******************************
Single      956
Married     908
Divorced    162
Name: Marital_Status, dtype: int64
******************************
Less than $40K    696
$80K - $120K      526
$40K - $60K       370
$60K - $80K       292
$120K +           142
Name: Income_Category, dtype: int64
******************************
Blue        1876
Silver       119
Gold          26
Platinum       5
Name: Card_Category, dtype: int64
******************************

Creating Dummy VariablesΒΆ

InΒ [122]:
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 28) (2026, 28) (2026, 28)

Observation

There are now 28 columns after the mapping There were originally 20 columns (including the target)

Model BuildingΒΆ

Model evaluation criterionΒΆ

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

InΒ [128]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1

        },
        index=[0],
    )

    return df_perf
InΒ [129]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Model Building with original dataΒΆ

Sample code for model building with original data

InΒ [132]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
'_______' ## Complete the code to append remaining 3 models in the list models
models.append(("Decision tree", tree.DecisionTreeClassifier(random_state=1, class_weight='balanced')))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("GradientBoost", GradientBoostingClassifier(random_state=1)))
InΒ [133]:
print("\nTraining and Validation Performance Difference:\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores_train = recall_score(y_train, model.predict(X_train))
    scores_val = recall_score(y_val, model.predict(X_val))
    difference1 = scores_train - scores_val
    print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference1))
Training and Validation Performance Difference:

Bagging: Training Score: 0.9857, Validation Score: 0.8160, Difference: 0.1697
Random forest: Training Score: 1.0000, Validation Score: 0.8067, Difference: 0.1933
Decision tree: Training Score: 1.0000, Validation Score: 0.8190, Difference: 0.1810
AdaBoost: Training Score: 0.8453, Validation Score: 0.8589, Difference: -0.0136
GradientBoost: Training Score: 0.8760, Validation Score: 0.8681, Difference: 0.0079

Observation

The best model with original data is GradientBoost with .0079 difference between Training and Validation. It also has good scores for accuracy on both Training and Validation at 87%.

Model Building with Oversampled dataΒΆ

InΒ [136]:
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
Before Oversampling, counts of label 'Yes': 976
Before Oversampling, counts of label 'No': 5099 

InΒ [137]:
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
InΒ [138]:
print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))


print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
After Oversampling, counts of label 'Yes': 5099
After Oversampling, counts of label 'No': 5099 

After Oversampling, the shape of train_X: (10198, 28)
After Oversampling, the shape of train_y: (10198,) 

InΒ [139]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GradientBoost", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Decision tree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))
InΒ [140]:
print("\nTraining and Validation Performance Difference:\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores_train = recall_score(y_train_over, model.predict(X_train_over))
    scores_val = recall_score(y_val, model.predict(X_val))
    difference2 = scores_train - scores_val
    print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference2))
Training and Validation Performance Difference:

Bagging: Training Score: 0.9976, Validation Score: 0.8558, Difference: 0.1418
Random forest: Training Score: 1.0000, Validation Score: 0.8374, Difference: 0.1626
GradientBoost: Training Score: 0.9820, Validation Score: 0.8926, Difference: 0.0893
Adaboost: Training Score: 0.9651, Validation Score: 0.8681, Difference: 0.0970
Decision tree: Training Score: 1.0000, Validation Score: 0.8098, Difference: 0.1902

Observations

GradientBoost has the best performance with oversampled data and has better accuracy than with the original data.

Model Building with Undersampled dataΒΆ

InΒ [143]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
InΒ [144]:
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 976
Before Under Sampling, counts of label 'No': 5099 

After Under Sampling, counts of label 'Yes': 976
After Under Sampling, counts of label 'No': 976 

After Under Sampling, the shape of train_X: (1952, 28)
After Under Sampling, the shape of train_y: (1952,) 

InΒ [145]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GradientBoost", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Decision tree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))
InΒ [146]:
print("\nTraining and Validation Performance Difference:\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores_train = recall_score(y_train_un, model.predict(X_train_un))
    scores_val = recall_score(y_val, model.predict(X_val))
    difference3 = scores_train - scores_val
    print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference3))
Training and Validation Performance Difference:

Bagging: Training Score: 0.9928, Validation Score: 0.9141, Difference: 0.0787
Random forest: Training Score: 1.0000, Validation Score: 0.9294, Difference: 0.0706
GradientBoost: Training Score: 0.9795, Validation Score: 0.9632, Difference: 0.0163
Adaboost: Training Score: 0.9539, Validation Score: 0.9693, Difference: -0.0154
Decision tree: Training Score: 1.0000, Validation Score: 0.9018, Difference: 0.0982

Observations

Adaboost and GBM have similar results when under sampling. GBM has a better Training score and comparable Validation Score.re

Overall observationsΒΆ

After building the models, it was observed that both the GBM and Adaboost models, trained on an undersampled dataset, as well as the GBM model trained on an oversampled dataset, exhibited strong performance on both the training and validation datasets.

Sometimes models might overfit after undersampling and oversampling, so it's better to tune the models to get a generalized performance

We will tune the four best models using the same data (original or undersampled or oversampled) as we trained them on before:

  • Original: GradientBoost: Training Score: 0.8760, Validation Score: 0.8681, Difference: 0.0079
  • Oversampled: GradientBoost: Training Score: 0.9820, Validation Score: 0.8926, Difference: 0.0893
  • Undersampled: GradientBoost: Training Score: 0.9795, Validation Score: 0.9632, Difference: 0.0163
  • Undersampled: Adaboost: Training Score: 0.9539, Validation Score: 0.9693, Difference: -0.0154

HyperparameterTuningΒΆ

Sample Parameter GridsΒΆ

Note

  1. Sample parameter grids have been provided to do necessary hyperparameter tuning. These sample grids are expected to provide a balance between model performance improvement and execution time. One can extend/reduce the parameter grid based on execution time and system configuration.
  • Please note that if the parameter grid is extended to improve the model performance further, the execution time will increase
  • For Gradient Boosting:
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}
  • For Adaboost:
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
  • For Bagging Classifier:
param_grid = {
    'max_samples': [0.8,0.9,1],
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}
  • For Random Forest:
param_grid = {
    "n_estimators": [50,110,25],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}
  • For Decision Trees:
param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}
  • For XGBoost (optional):
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}
InΒ [154]:
'''
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
'''
Out[154]:
'\n# defining model\nModel = DecisionTreeClassifier(random_state=1)\n\n# Parameter grid to pass in RandomSearchCV\nparam_grid = {\'max_depth\': np.arange(2,6),\n              \'min_samples_leaf\': [1, 4, 7],\n              \'max_leaf_nodes\' : [10,15],\n              \'min_impurity_decrease\': [0.0001,0.001] }\n\n#Calling RandomizedSearchCV\nrandomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)\n\n#Fitting parameters in RandomizedSearchCV\nrandomized_cv.fit(X_train,y_train)\n\nprint("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))\n'
InΒ [155]:
'''
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
'''
Out[155]:
'\n# defining model\nModel = DecisionTreeClassifier(random_state=1)\n\n# Parameter grid to pass in RandomSearchCV\nparam_grid = {\'max_depth\': np.arange(2,6),\n              \'min_samples_leaf\': [1, 4, 7],\n              \'max_leaf_nodes\' : [10,15],\n              \'min_impurity_decrease\': [0.0001,0.001] }\n\n#Calling RandomizedSearchCV\nrandomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)\n\n#Fitting parameters in RandomizedSearchCV\nrandomized_cv.fit(X_train_over,y_train_over)\n\nprint("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))\n'
InΒ [156]:
'''# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
'''
Out[156]:
'# defining model\nModel = DecisionTreeClassifier(random_state=1)\n\n# Parameter grid to pass in RandomSearchCV\nparam_grid = {\'max_depth\': np.arange(2,6),\n              \'min_samples_leaf\': [1, 4, 7],\n              \'max_leaf_nodes\' : [10,15],\n              \'min_impurity_decrease\': [0.0001,0.001] }\n\n#Calling RandomizedSearchCV\nrandomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)\n\n#Fitting parameters in RandomizedSearchCV\nrandomized_cv.fit(X_train_un,y_train_un)\n\nprint("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))\n'

Gradient Boosting model with Original dataΒΆ

InΒ [158]:
%%time

#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8124803767660911:
CPU times: total: 1.72 s
Wall time: 24.9 s
InΒ [159]:
tuned_gbm0 = GradientBoostingClassifier(
    random_state=1,
    subsample=0.9,
    n_estimators=100,
    max_features=0.5,
    learning_rate=0.1,
    init=AdaBoostClassifier(random_state=1),
)
tuned_gbm0.fit(X_train_un, y_train_un)
Out[159]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
InΒ [160]:
# Checking model's performance on training set
gbm0_train = model_performance_classification_sklearn(
    tuned_gbm0, X_train, y_train
)
gbm0_train
Out[160]:
Accuracy Recall Precision F1
0 0.949 0.984 0.766 0.861
InΒ [161]:
# Checking model's performance on validation set
gbm0_val = model_performance_classification_sklearn(tuned_gbm0, X_val, y_val)
gbm0_val
Out[161]:
Accuracy Recall Precision F1
0 0.943 0.966 0.750 0.845

Gradient Boosting model with Oversampled dataΒΆ

InΒ [163]:
%%time

#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8124803767660911:
CPU times: total: 1.95 s
Wall time: 13.8 s
InΒ [164]:
tuned_gbm1 = GradientBoostingClassifier(
    random_state=1,
    subsample=0.9,
    n_estimators=100,
    max_features=0.5,
    learning_rate=.1,
    init=AdaBoostClassifier(random_state=1),
)
tuned_gbm1.fit(X_train_over, y_train_over)
Out[164]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
InΒ [165]:
# Checking model's performance on training set
gbm1_train = model_performance_classification_sklearn(tuned_gbm1, X_train_over, y_train_over)
gbm1_train
Out[165]:
Accuracy Recall Precision F1
0 0.980 0.981 0.978 0.980
InΒ [166]:
# Checking model's performance on validation set
gbm1_val = model_performance_classification_sklearn(tuned_gbm1, X_val, y_val)
gbm1_val
Out[166]:
Accuracy Recall Precision F1
0 0.960 0.893 0.864 0.878

Gradient Boosting model with Undersampled dataΒΆ

InΒ [168]:
%%time

#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 75, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9508320251177395:
CPU times: total: 734 ms
Wall time: 5.22 s
InΒ [169]:
tuned_gbm2 = GradientBoostingClassifier(
    random_state=1,
    subsample=0.9,
    n_estimators=75,
    max_features=0.7,
    learning_rate=0.1,
    init=AdaBoostClassifier(random_state=1),
)
tuned_gbm2.fit(X_train_un, y_train_un)
Out[169]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.7, n_estimators=75, random_state=1,
                           subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.7, n_estimators=75, random_state=1,
                           subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
InΒ [170]:
# Checking model's performance on training set
gbm2_train = model_performance_classification_sklearn(
    tuned_gbm2, X_train_un, y_train_un
)
gbm2_train
Out[170]:
Accuracy Recall Precision F1
0 0.971 0.977 0.966 0.971
InΒ [171]:
# Checking model's performance on validation set
gbm2_val = model_performance_classification_sklearn(tuned_gbm2, X_val, y_val)
gbm2_val
Out[171]:
Accuracy Recall Precision F1
0 0.936 0.954 0.732 0.828

AdaBoost model with Undersampled dataΒΆ

InΒ [173]:
%%time

# defining model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(10, 40, 10),
    "learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 30, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.9375039246467818:
CPU times: total: 438 ms
Wall time: 2.23 s
InΒ [174]:
tuned_adb = AdaBoostClassifier(
    random_state=1,
    n_estimators=30,
    learning_rate=1,
    base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
)
tuned_adb.fit(X_train_un, y_train_un)
Out[174]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
                                                         random_state=1),
                   learning_rate=1, n_estimators=30, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
                                                         random_state=1),
                   learning_rate=1, n_estimators=30, random_state=1)
DecisionTreeClassifier(max_depth=2, random_state=1)
DecisionTreeClassifier(max_depth=2, random_state=1)
InΒ [175]:
# Checking model's performance on training set
adb_train = model_performance_classification_sklearn(tuned_adb, X_train_un, y_train_un)
adb_train
Out[175]:
Accuracy Recall Precision F1
0 0.970 0.975 0.965 0.970
InΒ [176]:
# Checking model's performance on validation set
adb_val = model_performance_classification_sklearn(tuned_adb, X_val, y_val)
adb_val
Out[176]:
Accuracy Recall Precision F1
0 0.932 0.966 0.714 0.821

Model Comparison and Final Model SelectionΒΆ

InΒ [178]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        gbm0_train.T,
        gbm1_train.T,
        gbm2_train.T,
        adb_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Gradient boosting trained with Original data",
    "Gradient boosting trained with Oversampled data",
    "Gradient boosting trained with Undersampled data",
    "AdaBoost trained with Undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[178]:
Gradient boosting trained with Original data Gradient boosting trained with Oversampled data Gradient boosting trained with Undersampled data AdaBoost trained with Undersampled data
Accuracy 0.949 0.980 0.971 0.970
Recall 0.984 0.981 0.977 0.975
Precision 0.766 0.978 0.966 0.965
F1 0.861 0.980 0.971 0.970
InΒ [179]:
# Validation performance comparison

models_train_comp_df = pd.concat(
    [ gbm0_val.T, gbm1_val.T, gbm2_val.T, adb_val.T], axis=1,
)
models_train_comp_df.columns = [
    "Gradient boosting trained with Original data",
    "Gradient boosting trained with Oversampled data",
    "Gradient boosting trained with Undersampled data",
    "AdaBoost trained with Undersampled data",
]
print("Validation performance comparison:")
models_train_comp_df
Validation performance comparison:
Out[179]:
Gradient boosting trained with Original data Gradient boosting trained with Oversampled data Gradient boosting trained with Undersampled data AdaBoost trained with Undersampled data
Accuracy 0.943 0.960 0.936 0.932
Recall 0.966 0.893 0.954 0.966
Precision 0.750 0.864 0.732 0.714
F1 0.845 0.878 0.828 0.821

** Observations**

  • Gradient boosting trained with Original data and AdaBoost trained with Undersampled data had similiar Recall values
  • Gradient boosting trained with Original data had slightly better accuracy

Test set final performanceΒΆ

InΒ [182]:
# Let's check `Gradient boosting trained with Original data` on the test set
gbm0_test = model_performance_classification_sklearn(tuned_gbm0, X_test, y_test)
gbm0_test
Out[182]:
Accuracy Recall Precision F1
0 0.938 0.969 0.731 0.833
InΒ [183]:
confusion_matrix_sklearn(tuned_gbm0, X_test, y_test)
No description has been provided for this image

Let's check both Gradient boosting trained with Original data and AdaBoost trained with Undersampled data against the test data

InΒ [185]:
# Let's check the AdaBoost trained with Undersampled data performance on test set
ada_test = model_performance_classification_sklearn(tuned_adb, X_test, y_test)
ada_test
Out[185]:
Accuracy Recall Precision F1
0 0.925 0.957 0.693 0.804
InΒ [186]:
confusion_matrix_sklearn(tuned_adb, X_test, y_test)
No description has been provided for this image

Observation

  • Gradient boosting trained with Original data performed best against the test data.
  • This performance is in line with what we achieved with this model on the train and validation sets
  • Gradient boosting trained with Original data is a generalized model

Feature importanceΒΆ

InΒ [189]:
feature_names = X_train.columns
importances = tuned_gbm0.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image

Observations

Total_Trans_Ct, Total_Trans_Amt, Total_Revolving_Bal, Total_Ct_Chng_Q4_Q1, Total_Amt_Chng_Q4_Q1 were the most significant features for making predictions

Business Insights and ConclusionsΒΆ

Observations

By reviewing the model, we can learn that the following factors are important in credit card attrition:

  • The best predictors of whether a customer continues with credit cards is their current usage, transaction amounts, balance.
  • The second big indicators are the change in usage both in terms of transaction amounts and transaction counts.
  • The third highest indicators are the utilization and the number of times the customer contacts or meets with the bank.

Recommendations

While taking risk to bank into account, the bank could become aggressive with:

  • Balance transfer incentives may lead to higher utlization and higher revolving balance, both key indicators for attrition.
  • Cash back incentives for card usage
  • Provide incentives for the customer to add another product such as personal loans - even if it isn't a credit card. The idea is deepen the customer relations metric and cross sells services.
  • Availability and preferred card status at online providers, such as SHOP or PAYPAL or others to increase transaction count.
  • Many credit cards also provide "Pay in 4" and charge the credit card over time as incentives to increase amount.

The economy, interest rate changes were not included in the study and may be significant sources. This study does not include reasons or insights into why customers might no longer want our bank's credit cards.